Variability, Validity and Operator Reliability of Three Ultrasound Systems for Measuring Tissue Stiffness: A Phantom Study

Introduction Ultrasound elastography is a method of measuring soft tissue stiffness to detect the presence of pathology. There are several ultrasound elastography devices on the market. The aim of this study was twofold. Firstly, to determine the validity of three different ultrasound systems used to measure tissue stiffness. Secondly, to determine the operator reliability and repeatability when using these three systems. Materials and methods Two observers undertook multiple stiffness measurements from a phantom model using three different ultrasound systems; the LOGIQ E9, the Aixplorer, and the Acuson S2000. The phantom model had four cylindrical-shaped inclusions (Type 1-4) of increasing stiffness values and diameter embedded within. The background phantom stiffness was fixed. The mean, standard deviation, and coefficient of variation (CV) were calculated from measured stiffness readings per diameter per inclusion. Intra-observer variability was assessed. The validity of the measured stiffness value was assessed by calculating the difference between the measured elasticities and actual phantom elasticities. Bland-Altman plots with limits of agreement were used to display the inter-observer agreement. The intraclass correlation coefficients (ICC) were used to measure intra-observer, inter-observer, and inter-system reliability. Results Each observer undertook 1020 measurements. All three systems generally underestimated the stiffness values for the inclusions; the higher the actual stiffness value, the more significant the underestimation. The percentage difference between measured stiffness and actual stiffness varied from -79.1% to 12.7%. The intra-observer variability was generally less than 5% for observers using the LOGIQ E9 and the Aixplorer systems but more than 10% over the stiffer inclusions (Types 3 and 4) for the Acuson system. There was 'almost perfect' intra-observer reliability and repeatability for both the LOGIQ E9 and the Aixplorer systems; this was 'moderate' for the Acuson system over specific inclusions. For all systems, there was 'almost perfect' inter-observer reliability and repeatability between Observer A and Observer B. The inter-system reliability and repeatability were 'almost perfect' between the LOGIQ E9 system and the Aixplorer system but 'poor' and 'moderate' when the Acuson system was matched with the LOGIQ E9 system and the Aixplorer system, respectively. Conclusion This study has demonstrated that the Acuson, LOGIQ E9, and Aixplorer ultrasound systems have low variability, high reproducibility, and good intra-observer and inter-observer reliability when used to measure tissue stiffness. However, they all underestimated the stiffness values during this in vitro study. This study also revealed that not all ultrasound systems are comparable when measuring tissue stiffness, with some having better inter-system reliability than others. Ongoing standardization of technology is required at the manufacturer level.


Introduction
Ultrasound elastography or shear wave elastography (SWE) is an ultrasound-applied imaging technique used to determine tissue stiffness due to underlying disease. This non-invasive technology allows healthcare professionals to observe and quantify signs of abnormal tissue stiffness in deeper structures [1]. Tissue fibrosis and tumors can lead to variations in tissue stiffness compared to the normal elasticity of the surrounding tissue [2]. Ultrasound elastography is inexpensive, easily accessible, portable, and clinically versatile, and its use continues to aid in early diagnosis and prompt treatment of fibrosis and tumors [2]. SWE technology employs the use of stimulated shear waves to determine Young's modulus (stress to strain ratio) of tissues in kilopascals (kPa): the higher the value, the higher the stiffness [2][3][4]. An ultrasound device sends impulses to the underlying tissue of interest, generating shear waves. Shear waves travel perpendicular to the original wave and are picked up by the ultrasound system software to produce an elasticity map of tissue stiffness [2][3][4]. Three types of shear wave imaging technology used in clinical applications are point SWE, one-dimensional transient SWE, and two-dimensional SWE [2][3][4]. SWE differs from strain (compressive) elastography, where stress is applied by repeated manual compression of the ultrasound transducer. The amount of lesion deformation relative to the surrounding normal tissue is measured and displayed in color [5].
There are several ultrasound systems on the market. Much of the technology and mechanisms behind SWE are patented and, therefore, inaccessible to researchers and clinicians [3]. However, only a few studies have explored the inter-system accuracy and reliability of different ultrasound systems [3,6]. Additionally, while previous studies have investigated intra-and inter-observer reliability, agreement, and validity [7][8][9], only some have explored the disparities in reliability and variability of measurements about the size of the specific region of interest (ROI). This data is essential for healthcare providers when deciding which ultrasound system to purchase for measuring tissue stiffness.
Therefore, the aim of this study was twofold. Firstly, to determine the validity of three different ultrasound systems by comparing measured and actual stiffness values. Secondly, to determine operator reliability by calculating the ultrasound systems' intra-observer, inter-observer, and inter-system reliability.

The ultrasound systems
The three ultrasound systems used for this study were the Acuson S2000 (Siemens, Germany), the Aixplorer (SuperSonic Imagine, France), and the LOGIQ E9 (General Electric, UK). The Aixplorer and LOGIQ E9 systems use two-dimensional SWE technology to measure kilopascals' stiffness (kPa). The Acuson S2000 system uses point SWE technology and measures the shear wave velocity in meters per second (m/s): a conversion factor was used to change this to kPa. The manufacturer-recommended brightness mode (Bmode) and elastography settings were used to ensure readings' consistency and methodology standardization. All three systems had similar linear probes with similar frequency bandwidths to allow comparison between the systems.

The phantom
A phantom model was used to represent human tissue to obtain stiffness values. The Elasticity QA phantom (Model 049, Zerdine®) was loaned from Computerized Imaging Reference Systems Company (CIRS, Virginia, USA) for this study (Figure 1). This phantom contains four cylindrical inclusions of various stiffness values (Type 1: 8 kPa; Type 2: 14 kPa, Type 3: 45 kPa, and Type 4: 80 kPa) at a depth 3.5 cm and four identical cylindrical inclusions at a depth of 6.0 cm. Each cylindrical inclusion is further divided into six partitions of decreasing diameters (16.7, 10.4, 6.5, 4.1, 2.5, and 1.6 mm). The background stiffness value of the phantom was 25 kPa. The five largest partitions of the cylindrical inclusion were included in this study as the smallest diameter proved challenging to quantify and visualize in B-mode imaging [4]. The ultrasound probes could not penetrate the inclusions at 6.0 cm depth, so only the inclusions at 3.5 cm depth were used in this study.

Study design
Observers A and B, blinded to each other's readings, undertook a set of measurements on the elastography phantoms using a single linear probe. Measurements were done on five partitioned sites on each of the four inclusions, totaling 20 sites. Each observer was instructed to do 20 repeat measurements per site, equating to 400 readings per ultrasound system used. Therefore, 1200 readings per observer were expected after using the three systems. In order to standardize the method, the ROI for each diameter was predetermined and remained the same for the Aixplorer and LOGIQ E9 ultrasound systems. However, a rectangular fixed ROI was used for the Acuson S2000 system. Each reading was taken after two frame rates after the probe was positioned, and the elastography function was selected and recorded.

Statistical analysis
Statistical analyses were performed using IBM SPSS Statistics for Windows version 22.0 (IBM Corp., Armonk, NY, USA). Each set of 20 measured stiffness readings attained per diameter per cylindrical inclusion was averaged to calculate the mean, standard deviation (SD), and coefficient of variation (CV). Intra-observer variability was assessed by the coefficient of variation (CV). The CV was determined by dividing the standard deviation by the mean value: the lower the CV, the lower the variability. A CV less than 0.1 (10%) was considered low variability. The validity of the measured stiffness value was assessed by calculating the mean difference (and percentage mean difference) between the measured stiffness values and actual phantom stiffness values. Bland-Altman plots with limits of agreement (95%, 1.96 SD) were used to display interobserver agreement and reliability and the difference between observers against the mean to display systematic variation [10]. Intraclass correlation coefficients (ICC) were used to measure intra-observer, inter-observer, and inter-system reliability/agreement. An ICC reading of 0.00-0.20 indicated 'poor' reliability, 0.21-0.41 indicated 'fair' reliability, 0.41-0.60 indicated 'moderate' reliability, 0.61-0.80 was interpreted as 'substantial' reliability, and a reading greater than 0.80 indicated 'almost perfect' reliability or agreement [11].

Results
A total of 2080 stiffness measurements were taken for this study, 1040 per observer. Each observer took 400 measurements of the phantom using the LOGIQ E9 system, 400 using the Aixplorer system, and 240 using the Acuson S2000 system. The Acuson system could not obtain stiffness measurements from the two smallest diameters (4.1 mm and 2.5 mm), so these were omitted. The Acuson system could not obtain measurements accurately at the 3.0 cm depth, the standardized parameter across the three systems. Therefore, a depth of 2.7 cm (reaching the three largest diameters) was used for the Acuson system. Table 1 reports the mean measured stiffness readings, a difference from actual values, and coefficient of variation (CV) values for each inclusion type for each of the three ultrasound systems per observer. A more extensive table (Table 5) displaying the values for each ROI for each inclusion can be found in the appendix. Both observers underestimated the elasticity values across all inclusions and all systems, except for the two smallest diameters on inclusion Type 1 (8 kPa), which were only measured using the Aixplorer and LOGIQ E9 systems. Observer A reported higher stiffness readings than Observer B throughout the study.

Intra-observer variability, validity, and accuracy
The intra-observer variability was generally low for both observers, indicated by a CV range of 0.01-0.05. The lowest variability was noted while using the Aixplorer ultrasound system, and the highest (CV more than 0.1) was noted while using the Acuson system, especially when measuring over the Type 3 and Type 4 inclusions.
Both observers reported an extensive range in the mean differences across all systems, indicating significant measurement biases and low validity and accuracy. The percentage difference between measured stiffness and actual stiffness varied from -79.1% to 12.7%. When using the LOGIQ E9 and Aixplorer systems to measure the Type 1 inclusion, observers overestimated the stiffness values. Stiffness values were underestimated using all three systems over all other ROI; the higher the actual stiffness value of the inclusion, the greater the underestimation. The Acuson system underestimated the stiffness value at every ROI and had the largest percentage differences between measured and actual stiffness values as the inclusion stiffness increased from Type 1 to Type 4.

Inter-observer agreement
Inter-observer agreement was evaluated using Bland Altman plots and the 95% limits of agreement. Figure 2 shows the plots for the 20 paired stiffness readings (calculated means), each for the LOGIQ E9 and the Aixplorer systems and 12 paired readings for the Acuson system. The plots show the variation and agreement of the mean of the measurements between Observer A and Observer B. The greater the number of plots that fell within the 95% limits of agreement, the greater the inter-observer agreement. For the LOGIQ E9 system, the mean difference in stiffness values between both observers was 0.50 (-2.43 -3.45) kPa, and 18 out of 20 (90%) plots fell within the 95% limits of agreement, indicating good agreement between both observers. For the Aixplorer system, the mean difference in stiffness values between both observers was 0.84 (-2.20 -3.88) kPa, and 18 out of 20 (90%) plots fell within the 95% limits of agreement, also indicating good agreement between both observers. For the Acuson system, the mean difference in stiffness values between both observers was 0.14 (-3.54 -3.26) kPa, and 11 out of 12 (91.7%) plots fell within the 95% limits of agreement, indicating good agreement between both observers. The mean difference for the Acuson system was the closest to zero yet had the greatest variation between the limits of agreement.

FIGURE 2: Bland Altman plots for inter-observer agreement and limits of agreement
For each graph, the Y-axis is the difference between paired stiffness measurements from Observer A and Observer B, while the X-axis is the mean of the paired stiffness measurements from Observer A and Observer B.

Intra-observer reliability
ICC was calculated to determine intra-observer reliability.  The ICC values are reported for both Observer A and Observer B for measurements over each inclusion for each ultrasound system. The lower and upper 95% confidence intervals are also reported.  The ICC values are reported for both Observer A and Observer B for each ultrasound system. The lower and upper 95% confidence intervals are also reported.

Inter-system reliability
The ICC was used to determine inter-system reliability and repeatability.  The ICC values are reported for each ultrasound system compared against one another for both Observers A and B. The lower and upper 95% confidence intervals are also reported.

Discussion
Ultrasound elastography has been widely introduced into the clinical environment. Studies and clinical use have extended to carotid lesions, breast lesions, pancreatic cysts, myocardial stiffness, liver lesions, testicular torsion, and malignant thyroid nodules, to name a few [12][13][14][15][16][17][18]. Thus, it is important to investigate and compare the performance of elastography systems that are commercially available for clinical use.
This study aimed to investigate the validity and operator validity across three different ultrasound systems. Two independent observers (Observer A and Observer B) measured stiffness/stiffness levels of four different inclusions in a single phantom model using the LOGIQ E9 system, the Aixplorer system, and the Acuson system. Within the phantom were four inclusions (with increasing stiffness, Type 1-4); each inclusion had five partitions with increasing diameter. The LOGIQ E9 and the Aixplorer systems slightly overestimated the stiffness values for the softest inclusion (Type 1) at its smallest diameter; the Acuson system failed to get measurements over the two smallest diameter sections of all inclusions. All three systems underestimated the stiffness values for the inclusions at all other ROIs when used by both observers; the higher the actual stiffness value of the inclusion, the greater the underestimation. The Acuson system underestimated the stiffness value by the greatest degree, by up to 80% for the stiffest (Type 4) inclusion. Observer A consistently had slightly higher stiffness values than Observer B for each measurement. The Acuson system could not get a reading for the two smallest diameters on the softest inclusion (Type 1). The intra-observer variability was generally low for both observers, especially when using the LOGIQ E9 system and the Aixplorer system, with CV less than 5%. At the same time, the variability was higher with the Acuson system (CV more than 10% when measuring over the stiffer Type 3 and Type 4 inclusions. There was a good interobserver agreement when using all three systems. There was 'almost perfect' intra-observer reliability and repeatability for all measurements using the LOGIQ E9 and the Aixplorer systems. However, this dropped to 'moderate' reliability for the Acuson system when used over the Type 2 inclusion by Observer A and the Type 4 inclusion by Observer B. For all systems, there was 'almost perfect' inter-observer reliability and repeatability between Observer A and Observer B. The inter-system reliability and repeatability were 'almost perfect' between the LOGIQ E9 system and the Aixplorer system. However, the inter-system reliability was 'poor' and 'moderate' when the Acuson system was matched with the LOGIQ E9 and the Aixplorer systems, respectively. The stiffness of the phantom was fixed at 25 kPa. It is possible that the ultrasound systems were unable to differentiate the background stiffness from the stiffness of the smallest diameter regions of the softest inclusion (Type 1), thus overestimating the stiffness value of the softest inclusion. A similar explanation would account for the marked underestimation of stiffness values for all the stiffer inclusions, especially those with actual stiffness values greater than the background (phantom stiffness). Previous research has questioned the validity of the phantoms being used for such studies [7].
The stiffness readings from the Acuson system were the least accurate of the three systems, with the greatest underestimation and highest variability. These inaccuracies have been highlighted in other studies using this system [4,6,9,19]. The Acuson system measured share wave velocity in meters per second (instead of stiffness), and results were converted to kPa before analysis. The conversion formula uses the density of the medium. The manufacturer did not provide the phantom specifications for density, so density was assumed to be 1g/cm3, the density of soft tissue, as this is a tissue-mimicking phantom. As this was an assumption, this may have led to some inaccuracies in the readings provided by the Acuson system. Additionally, the point SWE technology employed by the Acuson system may not be optimized for harder inclusions, thus producing error readings [4,6].
This study demonstrated high intra-observer and inter-observer reliability/agreement. Previous phantom studies have also reported comparable high intra-observer and inter-observer agreement [6,7,20]. There was also a good inter-system agreement between the LOGIQ E9 and the Aixplorer system but reduced agreement with the Acuson system. A previous study found great disparities between measurements obtained using the Aixplorer (Supersonic Imagine) versus the Siemens Acuson S2000, as found in our study. However, that study used a phantom with only one stiffness value [21].
This result has profound clinical importance, as they highlight that not all ultrasound systems are equivalent when measuring tissue stiffness. Ultrasound elastography is widely employed for diagnostic purposes; therefore, treatment, response, and prognosis decisions are based on such platforms. The results may differ if patients move between healthcare institutions employing alternative ultrasound systems; therefore, clinical decision-makers must consider this during clinical practice. Conversion formulae have been suggested to adapt the measurements between ultrasound systems to aid comparability [3].

Study limitations
The probes used in this study could not read at 6 cm depth as originally intended for this study, and therefore, depth analysis was excluded. All probes used were linear in design, but their frequency range was not identical; their frequencies varied from 2-10 MHz, which may have introduced some bias into the stiffness readings. However, the frequencies for the Aixplorer, LOGIQ E9, and the Acuson systems are comparable at 10 MHz, 9 MHz, and 9 MHz, respectively, so the effect of frequency may be less profound. To standardize the method, the ROI for each diameter was predetermined, and the readings were recorded after two frame rates. However, in practice, this was only partially possible. The ROI was easily specified to Aixplorer before readings, but for the LOGIQ E9, the ROI had to be manually circumscribed for every reading. The ROI was rectangular by default on the SA system. Therefore, these manipulations may have added errors or bias to the readings. Tissue-mimicking phantoms do not possess the viscoelastic properties of actual tissue. Therefore the results obtained do not necessarily fully reflect the stiffness readings in vivo in the clinical environment [7].

Further research
The phantom's stiffness may affect the inclusions' stiffness readings. The validity of the phantoms used in studies could be evaluated by employing multiple phantoms and observers and comparing more stiffness readings. Few studies have compared the reliability and accuracy of results obtained at different depths by ultrasound systems in a phantom. Therefore, depth analysis in vivo and in vitro is recommended for future research. More in vivo studies are required to determine actual tissue stiffness at different sites so accurate phantom can be developed for repeatability and variability studies of stiffness reading.

Conclusions
This study has demonstrated that the Acuson, LOGIQ E9, and Aixplorer ultrasound systems have low variability, high reproducibility, and good intra-observer and inter-observer reliability when measuring tissue stiffness. However, they all underestimate the true stiffness values during in vitro studies, requiring a correction formula. This study also revealed that not all ultrasound systems are comparable when used to measure tissue stiffness, with some having better inter-system reliability than others. Therefore, applying the readings from one machine to another in the clinical environment may not be accurate. So ultrasound systems require standardization between manufacturers to propel them into absolute clinical integration. We hope this study not only contributes to the literature but also encourages further research into this exciting topic.